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ABSTRACT 

Direct assessments of the accuracy with which raters 
can use a rating instrument are presented. 'This study demonstrated 
how surplus behavioral incidents scaled during the development of 
Behaviorally Anchored Rating Scales (BARS) can be used effectively in 
the evaluation of the newly developed scales. Construction of 
scenarios of. hypothetical incumbent job performance and alternative 
racing instruments makes fuller use of behavioral incident item pools 
that result from BARS development procedures. Ratee (hypothetical 
incumbent) performance levels are known from the scale values of 
items chosen to depict ratee performance and the relative accuracy 
with which raters may use -newly developed BARS can be evaluated in 
comparison with alternative formats developed as part of the 
evaluation process. Secondly, the study adds to the literature 
concerned with comparisons of rating formats in terms of their 
psychometric properties by contrasting the sole effects of rating 
format upon the psychometric quality of resulting scales. Again, BARS 
was an effective format for the rating of the individuals' 
performance. Finally, the virtue of rating accuracy as an evaluative 
criterion for assessing the psychometric quality of performance 
rating scales was extolled. (Author/CM) O 
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The psychometric properties of performance rating scales hove been 
assessed by the existence of constant rating errors in ratings provided 
on these .instruments. Typically, these assessments are relative ones in 
that two or more rating formats are compared with respect to their relative 
psychometric properties (Saal, Downey and Lahey, 1980). Frequently, rating 
scales hav.e been, evaluated along the following rating error criteria: (1) 
Halo - the tendency to base rating judgments on a global impression of a 
ratee, or "failure to discriminate among conceptually' distinct and potentially 
independent aspects of a ratee v s behavior" (Saal et al«, 1980, p. 4^.5); (2) 
Leniency /Severity - the tendency to assign higher or lower ratings than aret 
warranted by a ratee 's performance; and (3) Restriction of Range - truncation 
of the distribution of ratings compared to that warranted by actual variability 
in ratees' levels of performance. A rating scale that engenders less of each 
of these errors, compared to an alternative rating format ,\as judged to be 
psychometrically superior. 

"Conduct of comparative evaluations among rating scales, however, often , 
requires that some rather tenuous assumptions be made regarding several 
properties of .the distribution of "trite" levels of employees' 1 performance. 
For example, one operational definition of halo error examines the magnitudes 
of inter correlations among ratings assigned to ratees across performance 
dimensions. Higher 1 intercorrelations are taken to reflect greater existence 
of ^halo error in the ratings. Note the implicit assumption that employee 
performance levels should not be correlated, or correlatiohs Should be low 
across dimensions (conceptually distinct aspects) of his/her job. That is, 
there is np consideration of the possibility *of the existence of potentially 
latge amounts of "true halo" (Cooper, 1981), or actual Covariation , among 
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employees' levels of performance, a plausible condition given a general 

• * 

ability factor (Cronbach *and Snow, 1977), for instance. High intercorrela- 
tions could also reflect the results of training efforts designed to improve 
employee skills, on those aspects of his/her job where performance was at one 
time- relatively deficient. Thus, it would be possible to erroneously declare 
one rating scale a "psychometric winner" based on low interdimension correlations 
while more correct assessments of a high degree of true halo in ratee job 
4 performance on an alternative format would lead it to be declared psychometric- 
ally inferior. 

Similar problems exist for operational definitions of leniency /severity 
and restriction of range. A common method of assessing leniency involves the 
calculation of the third moment about the mean, a measure of skewness. The 
null hypothesis, i.e., the condition of no leniency bias, is that this measure 
is not significantly different from zero'. That is, the assumed underlying true 
^^distribution is approximately normal and has a mean located at or very near 
the scale n^dpoint. In fact, calculated values of skewness are often signif- 
icantly negative (cf. I<andy, Farr, Saal, & Freytag, 1976). This finding oould 
reflect a leniency error or bias. Or! the other hand, this finding could 
reflect the success of selection and promotion programs designed to cttoose 
and retain well performing emplbyees. Negatively skewed data could also 
reflect the effects of employee self-selection, or termination or withdrawal 
of less successful employees. In short, evaluation of leniency error by 
assessing degree of skewness in rating data is done against an. unknown referent. 

The same problem is encountered for a restriction of range criterion. 
Measures of range restriction (fourth moment about the mean or standard deviation 
of ratings across ratees within performance dimensions) could afbq reflect 
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rating error or the results of organizational influences, e.g., performance 
norms op ceiling or floor effects. With a rating error criterion correct 
assignment of cause is impossible. 

A final, somewhat related problem exists with another popular evaluative 
criterion - reliability of ratings. Although several appropriat^ means*' f or 
assessing the reliability of a rating instrument exist, the Intraclass 
Correlation (ICC) (Shrout and Fleiss, 1979) is probably the most popular (Saal 
et al., 1980). Although an appropriate statistic for the assessment If inter- 
rater agreement, any method of calculating the ICC (Shrout and Fleiss, 1979) 
can seriously underestimate true interrater agreement if there exists low 
variation* among ratees' performance levels (James, Wolf & Demaree, Note 1) . * 
Thus, if ratees 1 performance levels are unknown, so i^^the accuracy of an ICC 

i 

estimate of interrater agreement. 

In summary, the psychometric quality of rating instruments, implicitly, 
the accuracy with whicjh raters can use rating instruments, has been inferred 
from the nonexistence of deviations from some characteristic of an assumed 
prue distribution of employee performance. Since, as a general rule, the 
population parameters of such a distribution are not known, evaluations or 
comparative evaluations of scales have been largely conducted with unknown 
criteria. In this body of literature, Behaviorally Anchored Rating Scales 
(BAifST^tStniirh- -md-Kendel 1 , 1963) have received intensive study (Jacobs, Kafry 
& Zedeck, 1980; Kin^strom and Bass, 1981; Landy & Farr, 1980; Schwab, Heneman 
& DeCotiis, 1975). Generally, with rating error criteria, -BARS have not 
yielded psychometrically better quality ratings compared to other, often simpler 
and less expensively developed rating formats (Kingstrom and Bass, 1981; Schwab 
et al., 1975). Note, however, that this con elusion* is reached £rom research 
literature that has compared scales in terms of the relative degree to which 
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rating format^ engender rating errors, and thus must be tempered with realiza- 
tion of the ambiguities involved in this mode of comparative evaluation as 
out lined"* above . 

An alternative set of procedures exists for the evaluation of the psycho- 
metric integrity of newly developed rating scales. These are particularly 
appropriate to the development and evaluation of BARS. The intent of these 
evaluative methods is to provide (potential) raters with targets (ratees) 
whose performance effectiveness parameters are identifiable. As will be 
shown, this is accomplised via the use of scaled behavioral incidents obtained 
in the process of development of BARS. 

The development of BARS involves aix general step*:. (1) rational definition 
of performance dimensions; (2) generation of critical incidents of job perform- 
ance; (3) editing of critical incidents into the form of behavioral expectations ; 
(A) item scaling and "retranslation" of behavioral expecta^^s • (Smith and 
Kendall, 1963);%) item selection, and (6) final formattingrof BAfcS (see Schwab 
•t al., 1975 for an excellent summary of these procedures). The resulting i 
products of the first four of these steps are a set of rationally defined and 
consensually unambiguous dimensions of incumbent job performance, along with 
behavioral expectations scaled as to the degree performance effectiveness 
represented on a particular job dimension. A behavioral expectation item is 
eliminated from consideration as a behavioral anchor for BARS if it is not 
agreed among a criterion percent of judges as to which performance dimension it 
represents (retranslation criterion) and/or if it is not agreed upon what level 
of performance effectiveness is represented by the item (statidard deviation 
criterion, DeCotiis, 1978). Once the initial set of behavioral incidents are 
purged of items thus judged ambiguous, one if left with a pool, now reduced in 
number, of items deemed suitably unambiguous to qualify as a scale behavioral 
anchor. 
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Generally, a larger pool pf suitable Items results than Is required to 
sufficiently anchor BARS' scales (Zedeck, Jacobs and Kafry, 1976). Writers 
have recommended using these surplus items to construct parallel forms of 
BARS (Zedeck, Jacobs and Kafry, 1976) or alternative rating formats (Zedeck, 
Kafry and Jacobs, 1976). Still others have used these additional items to 
construct vignettes or scenarios of hypothetical incumbent job performance 
(DeCotiis, 1977; Sauser, 1979). It is in this final application of these surplus 
items that the potential for alternative scale evaluation procedures lie. 

Incorporation of scaled critical incidehts in a narrative description of 
employee performance provides a means by which levels of employees' performance 
effectiveness can be specified, i.e., by the scale values choaen to depict 
performance. While relative degree of rating accuracy has been inferred from 
the relative absence of traditional errors in rating data, when true performance 
levels are known, rating accuracy can be assessed directly by a simple metric: 
deviation of rating from true score, or performance level as depicted. Thus, 
direct assessments of the accuracy with which raters can use a rating instrument 
are made available. 

One purpose of the present study was illustrative: to demonstrate possible ' 
extended uses of surplus, psychometrically acceptable behavioral expectation 
items. Another secondary purpose was to evaluate an alternative rating format 
according to its psychometric efficacy in comparison with a BARS. Primarily, 
this study evaluated the usefulness of an alternative criterion for the 
evaluation of the psychometric properties of rating scales: direct assessment 
of rating accuracy. 

METHOD N 
Construction of Experimental Materials 

* As a result pf prior interview and task inventory approaches to,^ob 

7 ' 
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analysis of a set of selected secretarial jobs, eleven dimensions of secretar- 
ial job performance were rationally defined (York, Note 2). Next, positive 
and negative critical incidents of job performance were solicited from a sample 
of secretarial incumbents. Obtained incidents were edited to % yield 500 brief 
evaluative statements relevant to secretarial job performance. During editing, 
care was taken to preserve as much of incumbents' own language and terminology 
as possible. These 500 statements were randomly assigned to. and randomly 
arranged within five sets of 100 statements each. Another sample of secretaries 
(n - 100) then participated in item scaling by Thurstone^s Method of Equal- 
Appearing Intervals GEdward, s 1957) and retranslation of statements to performance 
dimensions as defined. Each subject was randomly assigned one booklet containing 
100 statements. Subjects rated each statement on a 7-point scale as to the 
level of performance exemplified, and allocated each statement to one of the 
rationally defined performance dimensions. The fallowing criteria were adopted 
to assure that only unambiguous behavioral items would be retained as potential 
BARS anchors. An item was retained: (1) if it had a Q-value of less than 1.9Q 
(Edwards, 1957); and (2) if it was allocated to one performance dimension by 
at least 67X of the respondees, with the constraint that no more than 20% of 
the other responses fell into any other one category. A total of 208 items 
met th&se criteria. An insufficient number of items were retranslated to two 
performance dimensions to adequately anchor spales for them, thus these were 
eliminated from further consideration in this study. The remaining nine perform- 
ance dimensions are listed in Table 1. d 

Behaviorally Anchored Rating Scales Between four and six behavioral items 
with larger percentage retranslations and smaller Q-values were selected to 
anchor scales for each performance dimension at points as evenly spaced along 
the scale as possible. The final BARS included the name of each performance 
dimension, a definition of that dimension, a vertical 7-point graphic scale 
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anchored numerically and adjectivally anc*> the behavioral anchors located along 
. % the sides of the graphic scale with an arrow pointing to a point near that 
corresponding to the anchor's scale value. . 

Weighted Checklist Item* for an alternative Weighted Checklist (WCL) rating 
format were selected from the common item pool and according to the same judgmental 
criteria as for the BARS. Five items for each of the nine performance dimensions 
were selected whose scale values represented as broadly and as evenly as possible 

ithe entire range of the 7-point scale. The final format included only the 45 
items arranged, randomly, following instructions for scale use. Use of the 
scale involved raters endorsing those items judged to be descriptive of a ratee's 
typical job performance. 

Scenarios of Secretarial Job Behavior Again, from the same common item 
pool, two items each were selected to describe hypothetical incumbent perform- 

ance on two different scenarios. These two items reflected nearly the same 

*'^«*» 

^pvel of performance within each job performance dimension (i.e., nearly equal 
scale values), either a high or low performance level on one scenario. Two 

sr. , 

items from the opposite end of the scale continuum were selected for the other 
scenario* This arrangement is depicted in Table 2. The selected items >*ere 
randomly arranged and formatted to follow a brief description of a hypothetical 
secretary arbitrarily named either "Cathy 11 or "Debra". 
Procedure 

Seventy-five secretaries were randomly assigned to one of two groups. 
Subjects in each group rated the performance of one of the hypothetical incumbents 
on both rating formats. All correspondence was conducted by mail and complete 
confidentiality of responses to the investigators was assured. 
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• RESULTS 



Rating Errors 

Halo Contradictory results were obtained regarding format superiority 

under different operational definitions of halo employed in the\ literature 

\ 

* r 

"* * 

(Saal et al. t 1980). First, intercorrelations of ratings across job perform- 
ance dimensions were significantly larger for the WCL than for the, BARS, I 
indicating less halo error on the BARS. Second, Principal Components .Analyses 
(Mulaik,* 1972) yielded two components that accounted for approximately 75% of 
the rating variance for each v format /scenario combination, indicating no differ- 
ence in halo error between the two rating formats. Third, standard deviations 
of each rater's ratings of one scenario across performance dimensions were 
significantly larger for the WCL than for the BARS, indicating less halo 
error in ratings obtained on the WCL format. Thus, the first and third commonly 
used operational measures of halo error suggested opposite conclusions regarding 
the relative existence of halo error on the two rating formats. 

Leniency/Severity One statistical test often employed to assess the 
existing leniency or severity error is to test for a difference between the mean 
of obtained ratings and the scale midpoint, the theoretical population mean (Saal , 
et al, 1980). Of 18 such comparisons (two scenarios by nine performance dimen- 
sions each)/ 13 mean BARS ratings differed significantly from the scale mid- 
point add ten significant differences were obtained for the -WCL. Without 
exception, the differences in means from the scale , midpoints were in the 
direction of depicted performance level (i.e., high or low performance effeptive- 
ness) * Without the knowledge of actual performance levels, however, one might 
conclude that both rating formats engendered wild variations in ratings across 
performance dimensions. .Similar results were obtained for another operational 
definition of leniency error, the third moment about the mean (skewness). 
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Though negative in the majority, both significantly positive and negative 
values of skewness were obtained for both rating formats. These findings 
parallel the ones above and could lead one to conclude that either raters cannot 
use scales appropriately or that somehow the scale midpoints of various 
dimensions' scales have been mislocated. 

Restriction of Range ~ Even greater confusion was generated with two 
different statistical measures of range restriction, ^irst, standard deviations 
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of ratings within performance dimensiojfis (i.e., across raters) indicated that 
BARS almost invariably exhibited greater restriction of range (i.e., smaller 
standard deviations) than the WCL, but only on those performance dimensions 
wherein low performance effectiveness was depicted. Where high performance was 
depicted there were no differences. An entirely different picture was painted 
with a second operational definition of range restriction, the fourth moment 
about the mean (kurtosis). Significantly positive kurtosis values indicate a 
distribution more widely dispersed than normal' (platykurtosis) and negative 
values reflect narrower dispersion ( leptokur tosis) . No values of kurtosis were 
significantly different from zero for the BARS data. For the WCL data, on the 
other hand, in six of nine instance where high performance was depicted the 
data were significantly platykurtic and in eight of nine instance where low 
performance was depicted the data were significantly leptokurtic. Thu^, these 
two sets of results are not even remotely convergent. 

Evaluation of Rating Error Criteria Clearly, the two ratees (scenarios) 
in the present study are not apt to be representative of secretaries' configura- 
tions of their Job performance effectiveness across various aspects (dimensions) 
of their Job. Secretaries are unlikely to perform excellently on exactly half 
of their Job duties and extremely poorly on the other half. The data presented 
here, however, illustrate an extreme case of what can happen when rating error 
- ■ r 

/ 
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criteria are used to evaluate rating scales when there i^\ variation either among 
ratees or within ratees across dimeftsiojps of job perfornumce. The above results 
demonstrate the equivocalness of the conclusions that might be drawn from 
^uch an evaluation study. Either rating fonnat could have been implicated a^ 
engendering greater halo error depending upon the statistical definition chosen , 
and virtually no cle&r conclusions could be drawn frofi assessments of relative 



iwn from c 
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range restriction or leniency bias, although ratings dn the WCL were somewhat 
more gtfeatly elevated overall. In general, rating error criteria are not 
recommended. 
Accuracy 

As noted above, in the construction of scenarios' of secretarial Job perform- 

♦ 

ance, it was possible to determine a priori the level of* Job performance 
effectiveness that would be depicted in each scenario by choosing scaled 

' 4 • ^ 

behavioral incidents, similar in scale value, to represent high or lpw perform- 

f 

ance levels across job dimensions. Thus, as depicted, performance ef fectivqness 
levels of ratees across performance dimensions were known. Given this, assessment 
of the accuracy with which raters can use alternative rating scales is straight-' 
forward. One need only quantify deviations of ratings from known performance , 
levels. Measures of rating inaccuracy were calculated for both formats on each 
performance dimension as average squared deviations from true (known) performance 
levels: 



where Acc is th*e mean rating inaccuracy for the Jth performance dimension, X.. 
is the JLth rater's rating of ratee performance on dimension J_ t and T ' is the 

/ ~t 

is the performance level depicted on that dimension. Results of Wilcbxon sign- 
rank tests for differences between formats are presented in Table 3. Recall 
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that What is . presented are median rating inaccuracy scores: smaller values 
indicate more accurate ratings. Note that there are few differences between 
formats on those dimensions whetfein high performance was depicted.- Tfeis is 
explained- by the generally elevated ratings on the WCL. • Overwhelmingly, however, 
raters provided more accurate evaluations of poor performance on the BARS ^ 
format. This is particularly important xni light of -the general finding of 
elevation of ratings (Kings trojn and Bass, 1981). An accepted finding is that 
ratings completed for administrative purposes are more lenient than ratings ^ 

completed anonymously for research purpdses (Land^ and Farr; 1980). Motivational 

■ ■ > 
factors aside, the present results suggest that raters can rate pcjpr performance 

more accurately on a BARS than an alternative WCL rating format. This is, raters 

were better able to assign an accurately^di^criminated evaluation of poorer 

•performance on the BARS,«while evaluations of. more effective performance were 

approximately equally assigned on both the ?ARS pnd WCL. Thus, the comparative 

criterion of rating inaccuracy, a measure that directly assesses validity of 

ratings t suggested BARS as the superior rating, format. 

Reliability , , 

In addition to validity, reliability represents another important criteriqn 

• : ' > ' -* * x 

for criteria. Note; that reliability Is a necessary but not suf ficient^ondition 

., A / - J 

for valid ratings, and conversely, validity is sufficient but not necessary for 
reliability. The present experimental design readily lends itself to assessment 
of interrater reliability via the Intraclass Correlation (ICC). As mentioned 
earlier, lack of between-ratee differences in actual performance can lead to 
erroneous estimates of ^interrater reliability. The application of the ICC as 
an index of interrater reliability in the present case is appropriate since, as 
described above, scenarios were constructed to reflect varying degrees of 
effectiveness of performance across performance dimensions. As shown in Table » 

•' . 13 , • 



4, ratings we$re generally more reliable on the BARS than on the WCl (median 
ICC's * .42 and .22 respectively). 

DISCUSSION ' k. 

in , line with stated objectives, .the present study addressed three main 
Issues. First, it was demonstrated how surplus behavioral incidents scaled 
during the development of BARS can be used effectively in the evaluation of 
the newly developed scales. Construction of scenarios of hypothetical incumbent 
job performance and alternative rating instruments makes fuller use of behavior- 
al incident item poolS that result from BARS, development procedures. R&tee 
(hypotheti^a^jLnciiinbent) performance levels are known from the sqale values of 
Items chosen to depict ratee performance and the-*relative accuracy with which 

raters may use newly developed BARS can be evaluated in compatisonrwith alter- 

1 I ° 

native .formats developed as part of the evaluation process. 

* > 

Secondly, the present study adds to the already large body of literature 
concerned with comparisojis of rating formats in, terms of their psychometric 
properties. In the past, researchers have often confounded rating format, , 
developmental procedures and job performance domain surveyed by the rating 
scale in their comparisons among instruments (cf * , Borman and Dunnette, 1975; 
Burna^ka and Hollman, 19^4; DeCotiis, 1977). The present study, as have some 
o^iers^f. Zedeck, Kafry and Jacobs, 1976) contrasted the sole effects of 
rating^format upon the psychometric quality of resulting scales. Again, BARS 
was supported as an .effective format for the rating o^ individuals' performance. 

Finally,- the virtue of rating accuracy as an evaluative criterion for 

i ^ I 

assessing the psychometric quality of performance rating scales was 'extolled. 

The use of the metric of rating inaccuracy described here, however, assumes that 

some more objective measure of what quality is being rated is available. The 

idea of using standardized stimuli^ as ratees is not new. DeCotiis (1977) and 



Sauser (1979) both constructed standardized scenarios of incumbent 1ob 
performance much in the manner erf the present study. Also, Borman (1979) 
used videotaped job performance "as the rating stimulus. The recommendation 
here is for more routine use of standardized stimuli such as vignettes of 
performance for the evaluation of rating instruments. Also, the procedures 
recommended here are not limited to construction of BARS, although they are 
most applicable here. Similar procedures could be adapted for Likert-type 
scale development. 

The primary advantage of evaluating performance rating instruments in 

' #*- 

terms of the accuracy with which raters can t<se the scales lies in the direct- 
ness^ of the approach. As discussed above, rating error criteria-are attempts 
to quantify deviations from accurate ratings indirectly* On the other hand, 
a metric of deviations of rated values from true performance scores such as thev, 
one utilized in the present study direction assess rating accuracy. 

One obvious limitation to the approach advocated heret is its generalizabiXity 
to actual use, for example, the rating of real individuals active in ongoing ' 
organizational activities. That is, results from procedures such as those 
outlined in this paper may not b^ strongly externally valid. On the other , 
hand, ttfese procedures .way represent the strongest instance of Internal valid- 
ity. Rating of vignettes of hypothetical incumbent job performance may be 
conditions conducive to the most accurate possible use of newly developed rating 
instruments. These ideal conditions will simply not exist in the "real world". 

The complete evaluation of newly developed rating instruments may inevit- 
ably require a two-step process. Primarily one may wish to assess the accuracy 

i 

wlth>hich raters can evaluate ratee performance with the use of stnadardized 
stimuli, such as job performance scenarios. Secondarily, pilot testing of 
scales .wit)i supervisory ratings of subordinate performance, using rating error 
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criteria, wou^d hopefully reinforce the conclusions from the prior analysis. 
In such a secondary analysis, however, the researcher need be aware of the 
assumptidns necessarily made with rating error criteria, and interpret the 
results of such an evaluation study in an appropriately circumspect manner. 
So done, these two approaches to the evaluation of the psychometric properties 
of performance evaluation instruments can be complementary. 



\ 
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Table 1 

Secretarial Job Performance Dlmens ions 



A. Bookkeeping and Financial 

B. Composing or Editing 

C. Filing, .Sorting, Routing, etc. 

D. Gathering Information 

E. Handling Materials 

F. Communications and Public Relations 
Xf. Operating or Maintaining Machines 

H. Supervising, Directing, Deciding 

I. Typing or Data Entry 
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Table 2 

Scenarios of Secretarial Job Performance* 



^Performance 
Dimension A B C D E F G H I 

Scenario 

• "Cathy" L HLHHLHLH 
"Debra" % L H L L H L H L 



Performance effectiveness depcited is either High (H) or 
Low (L) 



21 
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Table 3 



Comparisons Among Formats* Median Rating Inaccuracy Scores 



1 

Perf. . 


„ Scenario 1 
BARS WCL 


T 2 


Perf. 

T\ -1 mam 

Dlmen • 


Scenario 2 
BARS SCL 


T 


A(L) 


1.50, 


. 3.82 


• 

197** ' 


A(H) 


2.28 


2.95 


179.5*** 


B(H) 


.49 


1.10 


264.5 


B(L) • 


1.98 


4.98 


108*** 


C(L) 


2". 76 


1.61 


268 


C(H) 


.21. 


.58 


250 


D(H) « 


1.80 


2.53 


266 


D(L) 


1.28 


9.49 


111.5*** 


E00 


1.40 


2.19 


338 


E(L) 


4.. 20 


21.52 


47*** 


F(L) 


2.48 


. 1.88 


282.5 


F(H) 


.36 


1.17 


279 


G(H) 


.57 


1.99 


197.. 4** 


G(L) 


2.24 


15.55 


115*** 


H(L) 


.48 


3.03 ' 


101*** 


H(H) 


3.20 


•7 9 


176.5*** 


KH) • 


.55 


2.37 


262 


I*(L) 


2.02 


11.70 


127*** 



^Depicted performance is either high (H) or low (L) 

2 

Wilcoxon T statistic for Rank-Sign test 
** p less than, .02 
*** 'p less than .01 
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Tablc 4 
Intraclass Correlations 



Performance 
Dimension 

A 
B 
C 
D 
E 
F 
G 
H 
I 



BARS 


WCL 


.425 - 


.485 


.471 


.234 


.485 


.479 


.242 


.000 


.180 


.000 


.419 J 


.513 


.346 


.182 


.460 


.363 


.320 


.178 
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